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Abstract 

Background: Many computational approaches have been developed to detect protein complexes from protein- 
protein interaction (PPI) networks. However, these PPI networks are always built from high-throughput experiments. 
The presence of unreliable interactions in PPI network makes this task very challenging. 

Methods: In this study, we proposed a Genetic-Algorithm Fuzzy Naive Bayes (GAFNB) filter to classify the protein 
complexes from candidate subgraphs. It takes unreliability into consideration and tackles the presence of unreliable 
interactions in protein complex. We first got candidate protein complexes through existed popular methods. Each 
candidate protein complex is represented by 29 graph features and 266 biological property based features. GAFNB 
model is then applied to classify the candidate complexes into positive or negative. 

Results: Our evaluation indicates that the protein complex identification algorithms using the GAFNB model 
filtering outperform original ones. For evaluation of GAFNB model, we also compared the performance of GAFNB 
with Naive Bayes (NB). Results show that GAFNB performed better than NB. It indicates that a fuzzy model is more 
suitable when unreliability is present. 

Conclusions: We conclude that filtering candidate protein complexes with GAFNB model can improve the 
effectiveness of protein complex identification. It is necessary to consider the unreliability in this task. 



PROTEOME 
SCIENCE 



Background 

A protein complex is a group of two or more associated 
polypeptide chains. Proteins in a protein complex are 
linked by non-covalent protein-protein interactions 
(PPIs) and together participate a certain biological process. 
[1]. Protein complexes are a cornerstone of many biologi- 
cal processes and together they perform a vast array of 
biological functions [1]. So identifying protein complexes 
is crucial to understand the principles of cellular organiza- 
tion and predicting protein functions. 

A number of computational methods can be used to 
detect protein complexes from a PPI network [2], a gra- 
phical map of an entire organism's interactome which is 
constructed from PPI knowledge base by considering 
individual proteins as nodes, and the existence of a 
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physical interaction between a pair of proteins as a link. 
For example, CMC (clustering-based on maximal cli- 
ques) [3] discovers complexes from the weighted PPI 
network based on the maximal cliques. COACH [4] is a 
core- attachment[5] based method to detect protein 
complexes from PPI networks, where protein-complex 
cores from the neighbourhood graphs are mined and 
then formed protein complexes by including attachments 
into cores. Many graph-clustering methods can obtain a 
number of candidate protein complexes. However, the 
precision of these existing methods are only nearly 0.4. 
They got many false positive protein complexes in their 
results. Hence, classifying the true protein complexes 
from these results is a best way to improve the perfor- 
mance of protein complex detection methods. L. Chen, 
et al try to classify the protein complexes from candidate 
subgraphs with enriched features [6], where each protein 
complex is represented with a feature vector derived 
from the corresponding complex graph and biological 
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properties of the constituent proteins. However, the 
current PPI knowledge base generally is built from high- 
throughput techniques, such as mass spectrometry and 
yeast two-hybrid assays. The PPI information gathered 
can be unreliable and incomplete [7]. The common classi- 
fiers may have limitedness due to the presence of noise 
in PPI network. To address the noise issue, our previous 
method [8] proposed a genetic algorithm fuzzy Naive 
Bayes (GAFNB) model to do the classification. For 
improving the performance of identifying protein com- 
plexes, here we integrated GAFNB model as a filter in 
the process of protein complexes detection. We first got 
candidate protein complexes based on existed protein 
complexes detection methods. Each candidate subgraph 
is represented by a feature vector that includes 29 graph 
features and 266 biological property based features [6]. 
Then the genetic algorithm fuzzy Naive Bayes (GAFNB) 
model is trained to classify candidate protein complexes 
into positive or negative using positive protein complexes 
determined through experiments and negatives generated 
randomly. After filtering the results of protein complexes 
detection methods through GAFNB model, the precision 
of existing methods are improved. 

The rest of the paper is organized as follows: In the 
method section, we present a general framework of 
identifying protein complexes with our GAFNB filter. 



The experiment is described next. Experimental results 
and discussion are presented at last. 

Methods 

For a given organism, the proposed protein complex 
identification approach contains two steps (Figure 1). 
First step is to detect candidate protein complexes 
through some state-of-the-art protein complex detection 
algorithms. Second step is to filter the candidate protein 
complexes by GAFNB model. In the following, we first 
introduce the two state-of-the-art protein complex 
detection algorithms for identifying protein complexes. 
Then detailed GAFNB model is presented. 

Candidate protein complexes identification algorithms 

We implement two of the state-of-the-art protein complex 
identification algorithms here: COACH and CMC. The 
results of these existing computational methods provide 
candidate protein complexes for filtering. 

COACH[4] is a core-attachment[5] based method to 
detect protein complexes from PPI networks, where pro- 
tein-complex cores from the neighbourhood graphs are 
mined and then formed protein complexes by including 
attachments into cores. Proteins within the same protein- 
complex core detected by this method have high func- 
tional similarity and tend to be co-localized. 
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Figure 1 Flowchart of our method. First step is to detect candidate protein complexes through some state-of-the-art protein complex 
detection algorithms. Second step is to filter the candidate protein complexes by GAFNB model. 
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CMC [3] discovers complexes from the weighted PPI 
network based on the maximal cliques. It first uses an 
iterative scoring method (AdjustCD) to assign weight to 
protein pairs, and the weight of a protein pair indicates 
the reliability of the interaction between the two proteins. 
Then generates all the maximal cliques from the 
weighted PPI networks. Finally removes or merges highly 
overlapped clusters based on their interconnectivity to 
get protein complexes. 

GAFNB model 

After obtaining the candidate protein complexes, we need 
a filter to classify the candidates into positive or negative. 
However, the feature values in candidates are unreliable 
because PPIs are generally obtained from high-throughput 
experiments. The traditional classifier may not be suitable 
for this task. Our previous work has shown GAFNB 
can handle unreliable information in features [8- 11]. Hence 
we applied GAFNB model to filter the candidates for 
improving performance of protein complexes detection. 

A. Fuzzy Certain Feature Membership (FCFM) 

Because PPI data has some false positives and false 
negatives, the features value of protein complexes based 
on the PPI data can be uncertain. For example, the density 
of candidate protein complexes is calculated by edges 
(PPI) in the subgraph. So the density value is not certain. 
We call such features as Fuzzy Features. In contrast the 
reliable features are Certain Features. The uncertainty 
about the values for a feature can be represented as a 
matrix, wherein an element is the membership of a 
Certain feature value in a Fuzzy value (Table 1). As shown 
in Table 1, Xu is the membership of Certain feature value 
of density (<0.5) in Fuzzy feature value ( <0.5 ). The 
matrices for all features are orthogonal. We refer to such 
matrices as Fuzzy Certain Feature Membership (FCFM). 

B. GAFNB model 

Let P = {p a } be a set of candidate protein complexes with 
features A = {ai} and D = {d k } a set of classes for candi- 
date protein complexes. Let / be a subset of S represents 
some candidate protein complexes. The model is trained 
in the following steps using positive protein complexes 
determined through experiments and negatives generated 
randomly. 

1) Compute probabilities of certain feature values We 
define the conditional probability of = for class d k 
using Laplace correction [12] as follow: 



Table 1 Fuzzy certain feature membership 



p{ai = Vjj\dk) 



f{?ij n d k ) + 1 
/(4) + M 



(l) 



where {v^} is the set of values for feature a i} fiy^ n d k ) 
is the frequency count of instances in class d k having a t 
- Vij , f(d k ) is the number of instances in the dataset 
belonging to class d k , and |cXi| is the number of values 
possible for feature a t . 



Certain feature value (Graph 
density) 



Fuzzy feature value (Graph density) 



<0.5 



>0.5 



<0.5 



>0.5 



2) Compute optimal FCFM using Genetic Algorithm 
(GA) FCFM is a set of matrices that represent the fea- 
tures as mentioned above [11]. Each feature has one cor- 
responding matrix, and an element in the matrix is the 
membership of a Certain feature value in a Fuzzy value. 
We first create a set of FCFMs. Such FCFMs are referred 
to as the population. The matrix of corresponding FCFM 
is initialized to an identity one in the population. 

The Genetic Algorithm is applied on the FCFM popula- 
tion for calculating the membership of fuzzy feature value 
in the certain feature value. Three basic steps of GA are 
performed in several iterations (see Figure 2). We consider 
each iteration as a generation. The basic steps of selection, 
crossover and mutation are described below: 

a. Selection. Each FCFM gets a score from a fitness 
function. The high ranking ones are selected (see Figure 3). 
The detailed of fitness function is defined in our previous 
paper [11] .We performed n-fold cross validation to obtain 
a set of n accuracy measures from Naive Bayesian model. 
The probability of Fuzzy feature is calculated by the prob- 
ability of Certain feature combining with the given FCFM. 
The mean and standard deviation of the classification 
accuracies is computed for calculating the score of FCFM 
as follow: 



score = mean — standard deviations 



(2) 



b. Crossover. The selected FCFMs are referred to as par- 
ents. The corresponding matrices are randomly combined 
from two parents to generate new members (children). 



Initialization 






Selection 






Termination 



Mutation 



Crossover 



Figure 2 Genetic Algorithm. Tree Basic steps of GA: selection, 
crossover and mutation. 
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Figure 3 Computation of Fitness score. Steps of calculating 
Fitness score. 



Then the FCFMs who were not selected are replaced by 
child FCFMs. The instance of crossover operations of 
FCFM is shown in Figure 4. Two attributes density and 
mean degree of candidate protein complexes have Fuzzy 
feature value in this task. 

c. Mutation. An element of FCFM matrix is randomly 
selected and altered to a random value in the interval 
[0, 1]. The other elements in the same row also need to 
change by calculating for maintaining the orthogonality 



of the matrix. The number of mutations performed is 
determined by a parametric study. Also take the feature 
density and mean degree for examples, Figure 5 illus- 
trates the mutation operations on the FCFM. 

In summary, new population members are generated 
after each iteration. Figure 6 described the instances 
generated new population in the second iteration of GA. 
Finally, the scores of the population members converge 
to a constant value after some iterations. The FCFM 
with the highest score is selected for computing prob- 
abilities of fuzzy events. 

3) Compute probabilities of fuzzy events The member- 
ship value of Certain feature values v ix in Fuzzy feature 
value is denoted as [i^ v ix [9,10]. Conditional probabil- 
ities of fuzzy-feature values (v^) for particular classes 
(<i/ c ), are calculated as, 

P ( a = v |4 ) = P (at = Vix\dk)fi u Vix (3) 

\ i ij / x ij 

Marginal probability of fuzzy feature value Vy is, 
r(v)=J2 P(Vix)^Vix (4) 

\ ij/ x ij 

4) Inference When the feature of instances (candidate 
protein complexes) is fuzzy (I), the posterior probability 
for a class of this instance {d k ) is calculated using, 



Flu eiP{v\dk) 



P(d k \I) = P(4)- 



P(l) 



(5) 



Since the denominator P(I), is common for all candi- 
date protein complexes, it is dropped. Note that the 
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Figure 4 Crossover operation for the example FCFM. 
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Figure 5 Mutation operation for the example FCFM. This has been implemented as a two steps process. In the first step a), a parameter is 
randomly selected and its value is randomly changed to a value in the interval [0, 1]. In the second step b), the parameters in the row of the 
parameter changed in the previous step, are divided by their row sum, in order to maintain their sum as equal to 1. 



posterior probability is directly proportional to the label 
score. 

P{d k \I_)cxP(d k )Yl Mel P{v |4) (6) 

u i) 



datasets. The framework of our study is illustrated in 
Figure 7. Next, we combined GAFNB model with the 
state-of-the-art methods of protein complexes detection 
(CMC and COACH) to illustrate the utility of GAFNB 
filter. 



Experiments 

We plug in the GAFNB model as a filter in the process 
of protein complexes detection for improving the per- 
formances. For evaluating ability of dealing with fuzzy 
feature value, we first compared GAFNB model with 
Naive Bayes on two datasets. One ten-fold cross valida- 
tion run was performed for both models on each of the 



Experimental data 

We downloaded yeast protein interaction data from DIP 
[13] with a total of 17,201 PPI pairs. The protein com- 
plex data was downloaded from a public repository 
located at http://www.cs.cmu.edu/~qyj/SuperComplex. 
It consists of 493 protein complexes from MIPS [14] 
and TAP-MS [15] (size >2). 
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Figure 6 Population of FCFM in second iteration of Genetic algorithm. Member 1 has escaped mutation in the first iteration. Since 
members 2 and 5 have high scores they have been selected. 1 and 3 will be deleted and they will be replaced by progeny generated by 
crossover and mutation. 
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Figure 7 Flowchart of GAFNB evaluation. 



Evaluation matrix 

Since the benchmark of protein complexes is not com- 
plete and lots of negative protein complexes in the real 
world, we prefer to get more true protein complexes as 
fast as possible. The Lift [16] measure is a good choice 
for evaluating this task. When ranking the prediction 
score of a classifier, the ratio of known positives in top n 
is called estimated precision (EP). The baseline precision 
(BP) is the ratio of the known positives over the total 
number of samples in the data set. The Lift is defined as 
follows, 



Lift = EP/BP 



(7) 



which shows how fast the classifier obtains positives. 

Besides the Lift value, we also chose the Receiver 
operating characteristic (ROC) curve and the area under 
the ROC curve (AUC) values for evaluating our model 
[17]. ROC curve is a graphical plot which illustrates the 
performance of a binary classifier system as its discrimi- 
nation threshold is varied. 

We followed existing approaches [4,18,19] to evaluate 
the experimental performance of protein complexes 
identification. Equation 8 calculates the neighborhood 
affinity score NA(p,b) between a predicted cluster pe P 
and a real complex be B, where P is the set of predicted 
complexes by a computational method and B is the set 
of real ones in the benchmark. 



NA(p, b) 



\v P nv b \ 2 
|v p | x mi 



In equation 2, |V P | is the number of proteins in the 
predicted complexes and |V b | is the number of proteins 
in the real complex. If NA(p,b)>co, a real complex and a 
predicted complex are considered to be matching (<d is 
usually set as 0.20 or 0.25) [2]. 

After all real complexes and predicted clusters have 
their best match calculated according to their NA 
scores, precision, recall, and F-measure are applied to 
assess the methods: 



N cp = \{p\pep,3beB,NA{p,b) > co}\ 
N ch = \{b\b e B,3p e P,NA{p,b) > co} | 

Precision-- ^Recall --^ 



(9) 
(10) 

(11) 



(8) 



F = 2 x Precision x Recall/ (Precision + Recall) (12) 

N cp is the number of predicted complexes that match 
at least 1 real complex, and N cb is the number of real 
complexes that match at least 1 predicted complex [2]. 



GAFNB performance evaluation 

For training our model, we need the negative data. How- 
ever, it is rare to find a confirmed report of non-protein 
complexes. Hence, we randomly selected proteins in the 
PPI network for generating negative protein complexes. 
We evaluate our model on two datasets: a balance data- 
set, containing 493 positive and 493 negative protein 
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complexes, and an unbalance dataset having 493 positive 
and 9,878 negative protein complexes. The ratio of posi- 
tives to negatives is 20:1. It represents the real life sce- 
nario where positive protein complexes are very rare. 

Following a previous study [6], each protein complex is 
represented by a 295- dimensional feature vector. These 
include 29 graph features and 266 biological property 
based features. The graph features are extracted from the 
subgraph which formed by constituent proteins in the 
sample. The biological properties include biochemical 
properties, protein length and physicochemical proper- 
ties. Biochemical properties include amino acid composi- 
tions and secondary structure, while physicochemical 
properties include hydrophobicity, normalized van der 
Waals volume, polarity, polarizability and solvent accessi- 
bility(for details, please see ref [6]). Let a complex con- 
sists of n proteins, the mean and maximum biological 
feature values of n proteins are taken as corresponding 
complex feature values. 

Feature selection is the process of selecting a subset of 
relevant features for use in model construction. It can 
improve model interpretability, shorten training time 
and enhance generalisation by reducing overfitting. 
There are two common categories of feature selection 
algorithms: filters and wrapper. Filter methods produce 
a feature set which is not specific type of predictive 



model, such as information gain, chi-square test. It eval- 
uates each feature individually. While Wrapper methods 
usually provide the best performing feature set for that 
particular type of model. Since the model is fixed in our 
study, we chose wrapper method to select features. 
However, it is very computationally intensive. If the 
number of features is n, the number of possible feature 
sets is 2 n . Hence many popular search approaches use 
greedy hill climbing and best first, which iteratively eval- 
uates a candidate subset of features, then modifies the 
subset and evaluates if the new subset is an improvement 
over the old [20]. In our study, we used WEKA's Wrap- 
per selection [21] to find a proper feature subset for 
Naive Bayes model. It started from the empty set of 
features and used a forward best first search with a stop- 
ping criterion as five consecutive fully expanded non- 
improving. 

Protein complexes identification 

We chose 2 different state-of-the-art methods to get 
candidate protein complexes for GAFNB filtering. CMC 
and COACH are implemented on DIP network. For 
COACH, the argument was set to 0.225, as mentioned 
in their paper as mentioned in their paper [4]. CMC is 
implemented on a revised and weighted network by 
AdjustCD, the top 10000 PPI pairs are selected and the 
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Figure 8 Overlap of selected features subset in balance dataset. 
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Figure 9 Overlap of selected features subset in unbalance dataset. 



two arguments were both set to 0.25. We chose a balance 
dataset, containing 493 positive and 493 negative protein 
complexes for our experiment. We selected One five-fold 
cross validation run was carried out for GAFNB model. 
Each time four-fold is for training model and one fold is 
for testing final performance of protein complexes. In the 
evaluation of protein complexes identification, we first 
filtered out complexes whose NA score are above 0.5 in 
matching the four fold complexes, then calculate the 
performance of the 1 fold data. The average performance 
is calculated for comparison. 

Results and discussion 

GAFNB performance evaluation 
A. Feature selection 

The dataset is randomly split into 10-sets. Each set is 
selected in turn as the test set and the remaining sets 
are combined to form the training set for WEKA Wrapper 
algorithm. Hence, we had ten optimal features sets based 
on Naive Bayes classifiers for each dataset (balance dataset 
and unbalance dataset). Figure 8 shows the overlap of ten 
optimal feature subsets based on balance dataset. Only the 
feature weight edge variance with missing edge is selected 
for 10 times. The similar results in unbalance data as 
shown in Figure 9. Three Features (weight edge mean with 
missing edge, topological change 0.3 OA and degree max) 
are selected for 10 times. All this features that selected for 



ten times are graph topological features. While biological 
property features are different in each run. This is prob- 
ably because some of them are correlated and sensitive to 
training data. 
B. Evaluation results 

Ten-fold cross validation was run for evaluating the per- 
formance of GAFNB model. Table 2 show the Lift values 
of NB, GAFNB with 10, 30 and 50 generations on balance 
dataset. The Lift value of GAFNB with 50 generations is 
always higher than that for NB on balance dataset. 
GAFNB-50 got 2.01 Lift value in top 20, it significantly 
improved than NB. The more trained generations, the 
larger Lift value is. The reason is probably that the models 
given more generations are more optimized. The similar 
results are also obtained in unbalance dataset (Table 3). 
When the number of generations is above 10, the per- 
formance of GAFNB is always better than NB. We also 

Table 2 Lift value of GAFNB and NB in small balance 
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Table 3 Lift value of GAFNB and NB in large unbalance 
data 





NB 


GAFNB GE = 10 


GAFNB GE = 30 


GAFNB GE = 50 


Top 10 


15.63 


16.23 


16.63 


16.63 


Top 20 


14.23 


14.23 
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Top 30 
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13.36 
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11.22 
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10.5 
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9.12 


9.55 


9.39 
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8.16 


8.33 
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Top 80 


7.54 


7.52 


7.79 


7.97 


Top 90 


7.04 


6.92 


7.19 


7.28 


Top 100 


6.47 


6.53 
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6.93 



evaluated our model using AUC measure (Table 4). The 
AUC of GAFNB is also better than that of NB both 
in balance and unbalance data. Increasing with the num- 
ber of generations, the AUC of GAFNB gets better in 
balance data. While it is different for unbalance data, 
the AUC of GAFNB-30 is greater than that of GAFNB-50. 
It is possibly because the model was over fitted or a 
particular cross-validation set might have been localized 
to a local minimum. The ROC curve also reflected the 
better performance of GAFNB as shown in Figure 10. 
In the unbalance data, ROC curve of GAFNB is always 
higher than NB. While in the balance data, the true 
positive rate (TPR) of GAFNB is above that of NB 



Table 4 AUC of NB and GAFNB in balance data and 



unbalance data 



AUC 


NB 


GAFNB 


GAFNB 


GAFNB 






GE = 10 


GE = 30 


GE = 50 


Balance data 


0.9131 


0.9165 


0.9192 


0.9202 


Unbalance data 


0.8998 


0.9057 


0.9127 


0.9120 



when false positive rate (FPR) > 0.1. In summarize, all 
this indicates that performance of GAFNB is better 
than NB. A fuzzy model is more suitable when unrelia- 
bility is present. 

Protein complex identification 

The comparison results show that the precision and 
F-value for COACH increased from 0.3387 to 0.4852 
and 0.4465 to 0.5574, respectively, when filtering its 
results with GAFNB model over original ones (Table 5, 
Figure 11). However, there was a slight decrease in the 
recall (from 0.6551 to 0.6548). The generation value is 
chose 30 here. The feature subset was selected by 
WEKA based on balance data as mentioned in Figure 8. 
After filtering results of CMC with GAFNB model, the 
precision and F-value of CMC increased from 0.4055 to 
0.496 and 0.4984 to 0.5611, respectively. There was also 
a slight decrease in the recall (from 0.6466 to 0.6459) 
when using GAFNB filtering (Table 5, Figure 12). 
Generally, the Precision and F- measure increased when 
filtering them with GAFNB model. It indicates that 
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Table 5 Performance Comparison CMC, CMC+GAFNB, 
COACH and COACH+GAFNB 





Precision 


Recall 


F-measure 


COACH 


0.338688 


0.655172 


0.44654 


COACH+GAFNB 


0.48524 


0.654792 


0.557408 


CMC 


0.405479 


0.646552 


0.498395 


CMC+GAFNB 


0.495997 


0.645936 


0.561122 



GAFNB model can filter out some false positives. Since 
existing protein complexes identification methods only 
consider graph structure of PPI network, GAFNB model 
incorporate many biology features to filter out false 
positives. While CMC predicted only 365 candidate pro- 
tein complexes, filtering out some complexes can harm 
the recall. The more candidate protein complexes from 



existed identification methods, the better performance 
of GAFNB filter has. 

Evaluation of the contribution of each type feature 
toward protein complex identification will be done in 
the future. Some of our predicted complexes do not 
match any complex in the benchmark complex set. We 
found that the predicted complexes have high biologi- 
cal significance, as computed using P value, and high 
local density, as shown in Figure 13. They may be true 
complexes that are as yet undiscovered. The P values 
were calculated with the SGD's GO::TermFinder [22]. 
A low P value of a predicted complex generally indi- 
cates that the collective occurrence of these proteins in 
a complex does not occur merely by chance, and thus 
the predicted complex has a high statistical probability 
of being real. 
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False Positive Complexes 


P-Value 


Graph 


Co Function 


Go-Process 


Go-Component 


YDR388W YCR009C 
YAL005C 


2.93E-05 


0.000870526 


0.0751358 


( YAL005C 
(yCR009C) 


-[yOR388W] 


YKL196C YOR036W 
YDR264C YMR197C 


1.97E-06 


4.47E-06 


0.000381231 




)36wj 

^6C^ 




264C) 


YOR136W YIL094C 
YPL061W YJL130C 
YER110C 


0.000430584 


5.84E-05 


0.00353628 


(yeriioc) 

/\ jYPL061Wj 



Figure 13 Some false positive complexes with low P-Value. 



Conclusions 

In this paper, we plug a GAFNB model in the process of 
protein complexes detection. The candidate protein 
complexes from existed methods are filtered by our 
GAFNB model. Results show that the performance of 
protein complexes identification methods are improved 
using our GAFNB filter and the GAFNB model is more 
suitable when unreliability is present. In the future, we 
will apply our genetic algorithm fuzzy Naive Bayes model 
to deal with the other task with fuzzy features. 
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